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Abstract 

We study consistency of learning algorithms for a multi-class performance metric that is a 
non-decomposable function of the confusion matrix of a classifier and cannot be expressed as 
a sum of losses on individual data points; examples of such performance metrics include the 
micro and macro F-measure used widely in information retrieval and the multi-class G-mean 
metric popular in class-imbalanced problems. While there has been much work in recent years in 
understanding the consistency properties of learning algorithms for ‘binary’ non-decomposable 
metrics little is known either about the form of the optimal classifier for a general 

multi-class non-decomposable metric, or about how these learning algorithms generalize to the 
multi-class case. In this paper, we provide a unified framework for analyzing a multi-class non- 
decomposablc performance metric, where the problem of finding the optimal classifier for the 
performance metric is viewed as an optimization problem over the space of all confusion matrices 
achievable under the given distribution. Using this framework, we show that (under a continu¬ 
ous distribution) the optimal classifier for a multi-class performance metric can be obtained as 
the solution of a cost-sensitive classification problem, thus generalizing several previous results 
on specific binary non-decomposable metrics. We then design a consistent learning algorithm 
for concave multi-class performance metrics that proceeds via a sequence of cost-sensitive clas¬ 
sification problems, and can be seen as applying the conditional gradient (CG) optimization 
method over the space of feasible confusion matrices. To our knowledge, this is the first efficient 
learning algorithm (whose running time is polynomial in the number of classes) that is provably 
consistent for a large family of multi-class non-decomposable metrics. Our consistency result 
makes use of a novel proof technique based on the convergence analysis of the CG method. 


1 Introduction 

In many real-world classification tasks, the performance metric used to evaluate a multi-class clas¬ 
sifier is often a non-decomposable function of the confusion matrix of a classifier and cannot be 
expressed as a sum or expectation of losses on individual data points; this includes for example, 
the micro and macro F-measure used widely in information retrieval and the multi-class G-mean 
metric popular in class-imbalanced problems (see Tabled] for other examples). While there has been 
much work in recent years in understanding the consistency properties of plug-in or cost-sensitive 
risk minimization based learning algorithms for ‘binary’ non-decomposable metrics [D El 0 (2 0, 
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little is known about the form of the optimal classifier for a general multi-class non-decomposable 
metric, or about how these learning algorithms for binary performance metrics, which make use 
of a brute-force line search of a single threshold/cost parameter, generalize to the multi-class case, 
where the number of parameters needed to be tuned scales with the number of classes. 

In this paper, we provide a general framework for analysing a multi-class non-decomposable 
performance metric, where the problem of finding optimal classifier for the performance metric 
is viewed as an optimization problem over the space of all confusion matrices achievable under 
the given distribution. Using this framework, we show that, under a continuous distribution, the 
optimal classifier for any multi-class performance metric (that satisfies a mild condition) can be 
obtained by solving a cost-sensitive classification problem, where the costs are given by the gradient 
of the non-decomposable metric at the optimal confusion matrix. This result generalizes a previous 
result for binary non-decomposable metrics [2j and also recovers several previous results on the 
form of the optimal classifier for specific binary performance metrics 

A natural first-cut learning algorithm that arises from the above characterization is one that 
learns a plug-in classifier by applying an empirical weight matrix chosen by a brute-force search to 
a suitable class probability estimator. While this method can be shown to be statistically consis¬ 
tent with respect to the given performance metric (under a continuous distribution), it becomes 
computationally inefficient when the number of classes is large. As an alternative, we provide an 
efficient learning algorithm based on the conditional gradient (CG) optimization method (which 
we call the ‘BayesCG’ algorithm) that avoids a brute-force search over costs and can be seen as 
instead running the CG method over the space of feasible confusion matrices; the resulting algo¬ 
rithm proceeds via a sequence of cost-sensitive classification problems, the solutions for which take 
the form of plug-in classifiers. We show that the BayesCG algorithm is consistent for performance 
metrics that are concave functions of the confusion matrix; to the best of our knowledge, this is 
the first efficient learning algorithm (whose running time is polynomial in the number of classes) 
that is provably consistent for a large family of multi-class non-decomposable metrics. Also, unlike 
the brute-force plug-in method, the BayesCG algorithm requires no assumptions on the form of the 
optimal classifier for the given performance metric and hence on the underlying distribution. 

Our consistency result makes use of a novel proof technique based on the convergence analysis 
of the CG method [7]. More specifically, we show that the linear optimization step of the above 
CG method is solved approximately in the BayesCG algorithm and thus establish a regret bound 
for the algorithm for smooth concave performance metrics. For performance metrics that are non¬ 
smooth concave functions of the confusion matrix, we prescribe applying the BayesCG algorithm to 
a suitable smooth approximation of these performance metrics; we instantiate and show consistency 
of this approach for concave performance metrics such as the G-rnean, H-rnean and Q-mean. 

1.1 Related Work 

There have been several algorithms designed to optimize non-decomposable classification metrics, 
particularly in the binary classification setting; these include the binary plug-in algorithm that 
applies an empirical threshold to a class probability estimate 0 mu El GO, cost-sensitive risk mini¬ 
mization based approaches El 1311], methods that optimize convex and non-convex approximations 
to the given performance metric mmmmm, and decision-theoretic methods that learn a 
class probability estimate and compute predictions that maximize the expected value of the per¬ 
formance metric on a test set ESI EM- Of these, the plug-in method is known to be consistent 
for any binary performance metric for which the optimal classifier is threshold-based |2j, while 
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the cost-sensitive approach is shown to be consistent for the class of fractional-linear performance 
metrics [3]. There have also been results characterizing the optimal classifier for several binary non- 
decomposable metrics [U El El E], with the specific form of the classifier available in closed-form for 
fraction-linear metrics (i.e., metrics that are ratios of linear functions) (3j. 

We would also like to point out that there has been some work on designing algorithms for 
optimizing the F-measure in multi-label classification settings da mi im n and consistency re¬ 
sults for these methods m Eg, but these results do not apply to the setting considered in this 
paper. In particular, while the multi-class performance metrics that we seek to optimize are non- 
decomposable/non-additive over data points, the standard performance metrics of interest in a 
multi-label setting can indeed be expressed as a sum of losses on individual examples, with each 
loss on an example potentially being a non-decomposable function of the labels on the example. 

Organization. We start with some preliminaries and background on non-decomposable perfor¬ 
mance metrics in Section El In Section El we give a general framework for analysing multi-class 
non-decomposable performance metrics and use this framework to derive the form of the optimal 
classifier for a non-decomposable performance metric. Based on this characterization, we consider 
a brute-force plug-in method for a multi-class non-decomposable metric in Section El and show that 
this method is consistent. In Section El we design an alternate efficient learning algorithm based 
on the conditional gradient optimization method, which we show is consistent for a large family of 
concave non-decomposable metrics. All proofs not in the main text are provided in the Appendix. 

2 Preliminaries and Background 

Notations. For any n G Z + , we shall denote [n] = {1,... ,n}. For a predicate q !>, we shall denote 
by l(</>) the indicator function that takes value 1 if <f> is true and 0 otherwise. The probability 
simplex of dimension n will be denoted by A n = {p£ M” | Y2i=iPi = !}• For a matrix G € M nxn , 
we will use g y to denote the y th column of the matrix, and shall refer to ||G||i = l l \G%j\ 
as the l\ norm of G and to HGHoo = maxi<j <:; <„ \Gij\ as the norm of G; for any two matrices 
A,B G M nxn , we shall denote their component-wise inner product as (A, B) = Y^i=\ J2'j=i A,; j j. 
For any set C, we denote its closure under an appropriate metric space by C. For maximization over 
integral sets, the notation argmax shall refer to ties being broken in favor of the larger number. 

Problem Setup. Let X be an instance space and y = [n] be a set of class labels. We are given a 
training sample S = ((xi,yi),..., (x m ,y m )) G (X x [n]) m drawn i.i.d. according to an underlying 
(unknown) probability distribution D over X x [n], and the goal in a multi-class classification 
problem is to learn from these examples a prediction model hs : X —>• [n] , which when given a new 
instance x G X, makes a prediction y = hs(x ) G [n]. We will be interested in the more general 
problem of learning from S, a randomized classifier hs : X —>• A n that for each instance outputs a 
probability distribution over the labels in [n]; note that any deterministic classifier can be seen as 
randomized classifier whose output is always a vertex of the probability simplex A n . In particular, 
we will consider settings where the performance of hs is evaluated using a non-decomposable 
performance metric Vo ■ A* — >-R + that cannot be expressed as a sum or expectation of losses 
on individual examples. We shall denote the marginal of D over X as D% , the conditional class 
probabilities for an instance x as rj y (x) = P(Y" = y | X). My G [n], and the prior class probabilities 
as TT y = P(Y" = y), My G [n]; for a sample S, we shall use D$ to denote the empirical distribution 
which has its mass uniformly on the instances in S. 
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Table 1: Examples of performance metrics that are (continuous, bounded) functions of the confusion 
matrix. For any classifier h : X —> A n , we denote here TPRjJh] = P (h(X) = y \ Y = y), Prec^fh] = 
P(T = y | h(X) = y), and n y = P(V = y). Each performance metric here can be expressed as 
TT>[h] = t/?(conf(h, D ) = C), where the form of ^ : [0, l] nxn —for a performance metric is given 
in the fourth column; the last column provides important properties of ip, all of which hold over 
the set of feasible confusion matrices Cd (see Eq. ([3])). Note that for any C € Cd, 71 y — S y=l Cy,y- 


metric 

Definition 

Ref. 

V>(C) 

Accuracy 

E^i^TPR, 


EZ=1 Cy,y 

AM (1 - BER) 

se;=iTpr s 

ED 

1 Y^n, C 

n ^y= 1 i C'.. ^ 

z — , y =i y > y 


Binary Fi-metric 

2 

— 1 —+— 1 — 

Preci ' TPRi 

Jaccard Coefficient (JAC) 

7T 2 1 PH-2 

7T2 + 7Ti(l —TPRi) 

AMS metric 

- 

Micro Fi-metric 

- 

Macro Fi-metric 

1 v ~yn 2 

n 2-^y=l _L _i_ 

Precy 1 1 

H-Mean (HM) 

n 

^L=l TPHy 

Q-Mean (QM) 


G-Mean (GM) 

(n^iTPR,,) 17 

Min-Max metric 

minyein] TPRy 


m 

2C2,2 

2C'2,2+Cl,2 + C2,l 

m 

^2,2 

C2,2+C2,l+Cl,2 

m 

y^2((Ci2 + C22) log (1 + ) — ^ 22 ) 

[ 9^1 

2 J 2 y=2 Cyy 


2 ^Zy = 2 C yy + X^y= 1 C y,y 

m 

1 2Cy,y 

n^y=i E 5 =1 c VlS +E? =1 c SlV 

m 


m 


[29|[30] 

( FT™ Cy,y \ 

V ^ w =1 ^5=1 c y,y ) 

ED 

mm !/e[n] Vri V ' y c _ 

1 J ^y= 1 


Properties 

Linear 

Linear 

Non-concave, Pseudo-linear 
Non-concave, Pseudo-linear 
Convex 

Non-concave, Pseudo-linear 
Non-concave 
Concave, Non-smooth 
Concave, Non-smooth 
Concave, Non-smooth 
Concave, Non-differentiable 


Multi-class Non-decomposable Performance Metrics. Let us first define for a deterministic 
classifier h : X —» [n] and distribution D, the confusion matrix conf (h, D ) € [0, l] nxn as 

[conf (h,D)]. d = E (X)F) ^ D [l(y = i, h(X) = j)], Vi,j € [n]; 

the corresponding confusion matrix for a randomized classifier h : X —>• [n] is given by 

[conf(h ,£>)].. = E( XiY )~ D [hj(X) ■ 1(Y = *)], V*,j € [n]. 

In this paper, we shall be interested in non-decomposable performance metrics that can expressed 
as a continuous and bounded function ip : [0, lj nxn —>-M + of the confusion matrix: 

Po[h] = ip(coni(h, D)). (1) 

For example, the macro Fi-measure used widely in text retrieval can be expressed as a function 
^Fi(C) = -( V” ™— r — of the confusion matrix C € [0, l] nxri . Table [l] contains 

n y 2^y=i'^y,y'2^y=l'-'y,y 

several examples of performance metrics that are functions of the confusion matrixdo Throughout 
this paper, we shall use the term performance metric to refer to both V and ip. 


1 For all performance metrics considered in this paper, higher values indicate better performance. 

2 In the setting considered here, the goal is to maximize a performance metric that can be expressed as a (non- 
decomposable) function of expectations; this is referred to by Ye et al. (2012) [T] as the expected utility maximization 
setup and is different from the decision-theoretic setting that they consider, where one looks at the expectation of a 
non-decomposable performance metric on m examples, and seeks to maximize its limiting value as m->oo. 
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Algorithm 1 Plug-in Algorithm for Binary Non-decomposable Performance Metric. 

1 : Input: S = ((xi,yi),...,(x m ,y m )) € (A x [2]) m , ^ : [0, l] 2 x 2 -^R + 

2 : Split S into two sets S' and S” with sizes m\ = [(1 — a)m\ and m 2 = [am]. 

3: Learn fjs> = CPE(S"), where CPE : U^ =1 (A x [2]) m —> [0, l] x is a suitable CPE algorithm 

4: ? s e argmax te[01 ] Pfp^ht], where h t (x ) = 

5: Output: 


[1,0] T if VS'(x)<t 
[ 0 , 1 ] T otherwise 


^-consistency. We now consider the optimal value of performance metric V over all randomized 
classifiers: 

V* D = sup V D [h], 

h:X^A n 

and shall refer to the classifier attaining the above value, if one exists, as the ^-optimal classifier. 
One can then define the ^-regret of classifier h as 

regret £, [h] = V D [h] - V* D . 

A learning algorithm that takes a training sample S drawn i.i.d. from D m and outputs a classifier 
I 15 is said to be ^-consistent if the ^-regret of classifier I 15 goes to zero in probability: 

regret£, [hs] A 0, 

where the convergence in probability is over the random draw of S from D m Jf| 

Optimal Classifier for Decomposable Metrics. While in general, it is not clear if there exists 
a classifier that attains the optimal value of a given performance metric 'Pz)[h] = ^(conf [la]) , it is 
well-known that when ?/> is a linear function (i.e., Vd can be expressed as an expectation of a loss 
on individual example), a 'i/j-optimal classifier always exists. In particular, if ^ takes the form 

n n 

i’ G (C) Gy,yC y 3= (C,G), 

y = 1 y= 1 

for some matrix G € M” xn , then any classifier h* : X —> A n that satisfies the following condition is 
'i/’G-opthnal: 

K(x) > 0 only if i € argmax^j g.J? 7 (x) . ( 2 ) 

It is seen that there always exists a deterministic classifier that satisfies the above condition. Also, 
it is worth noting that maximizing the above performance metric is equivalent to solving a cost- 
sensitive classification problem, with the costs given by the the negative of the ‘gain’ matrix G. 

Plug-in Algorithm for Decomposable Metrics. A standard approach for maximizing a de¬ 
composable metric (or equivalently solving a cost-sensitive classification problem) is the plug-in 
method, where one first obtains a class probability estimation (CPE) model rj s : X —>• A n from the 

3 We say </>(S') converges in probability to a £ R, written as a, if Ve > 0, 

Ps~n m (|</>(5) — a\ > t) —► 0 as m —»• 00 . 
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given training sample S and constructs a classifier hs(x) = argmax^g r n i gyfj s (x) for any instance 
x. This approach can be shown to be if the CPE algorithm used to learn r) s is such 

that E.y [||t 75 /(X) — 0 [32] (which is indeed the case for any algorithm that performs 

a regularized empirical risk minimization of a proper loss such as the logistic loss M USD- 

Known Results for Binary Non-decomposable Performance Metrics. We now summarize 
what is understood about the the optimal classifier for binary non-decomposable performance met¬ 
rics and about the consistency properties of learning algorithms for these metrics. It is known that, 
under a continuous distribution, the optimal classifier for a binary monotonic non-decomposable 
metric is obtained by placing a suitable threshold on the posterior class probability function [2]. 
For certain specific performance metrics, such as those that are fractional-linear/ratio of linear 
functions (e.g., binary F-measure and JAC measure) E E51 El El, the geometric mean of precision 
and recall [2], and the approximate median sign (AMS) metric [5], this characterization holds even 
without the continuity assumption on the distribution; for some of these metrics, the exact form of 
the threshold is also available in closed-form 13El- it is also known that a plug-in algorithm that 
constructs a classifier by assigning an empirical threshold to a suitable class probability estimate 
(see Algorithm |T]) is statistically consistent with respect to any binary non-decomposable metric 
for which the optimal classifier is of the above thresholded form mm; a similar result has also been 
shown for a cost-sensitive risk minimization based approach for fractional-linear metrics [3]. 

While there has been a lot of work on binary non-decomposable metrics as seen above, little is 
known about how these results extend to the multi-class case. In particular, what is the form of 
the optimal classifier for a general multi-class non-decomposable metric? How does the plug-in and 
cost-sensitive risk minimization based algorithms for binary performance metrics, which essentially 
need to tune a single parameter, generalize to the multi-class case, where the number of parameters 
needed to be tuned grows with the number of classes? In this paper, we address these questions. 

Before we proceed further, we will find it convenient to define for any given function /i, : X — >-M n , 
the set of weighted argmax classifiers obtained by a gain matrix G £ M” xn on fi: 

T-L/j, = {h : X —> A n | 3 G £ M nxn s.t. Vx £ X , hi(x) = 1 if i = argmax^g^ [Gfj,(x)] y }. 

Finally, a function / : M dxd —is said to be L-Lipschitz w.r.t. the i\ norm over A4 C R dxd , for 
some L > 0, if 

|/(Mi) - /(M 2 )| < L||M b - Malli, VM b M 2 £ M, 
and is /3-smooth w.r.t. the l\ norm over A4 C M rfxd , for some /? > 0, if 

||V/(M b ) - V/(M 2 )|| 0O < /3||M b - Malli, VM b M 2 £ M. 

3 Characterization of the Optimal Classifier for a General Multi¬ 
class Performance Metric 

We start by providing a generic framework for studying a multi-class non-decomposable perfor¬ 
mance metric, where we view the problem of finding the optimal classifier for a non-decomposable 
metric as an optimization problem over the space of all confusion matrices that are attainable under 
the given distribution. Using this framework, we give a characterization of the optimal classifier 
for a non-decomposable metric; in particular, we show that under a continuous distribution, the 
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optimal classifier for any multi-class non-decomposable performance metric (that satisfies a mild 
condition) can be obtained by maximizing a decomposable performance metric, whose gain matrix 
is given by the gradient of non-decomposable metric at the optimal confusion matrix. To our knowl¬ 
edge, this is the first such result for a general multi-class non-decomposable metric, generalizing a 
previous result for binary non-decomposable metrics [2] and in addition also recovering previous 
results on the form of the optimal classifier for several performance metrics 0 El El- 

Feasible confusion matrices. We begin by defining the set of feasible confusion matrices for a 
distribution D as the set of all confusion matrices achievable by a randomized classifier under D\ 

Cd = {C £ [0, l] nxn ; C = conf(h.T)) for some h : X —» A n }. (3) 

Note that every matrix C £ Cd is such that its row sums are equal to the prior probabilities, i.e. 
Y^=i C y ,y = ny, My £ [re]. It can be shown that this set is convex. 

Proposition 1 (Convexity of Cd)- Cd is a convex set. 

The problem of finding the optimal classifier for the given performance metric can now be cast 
as an optimization problem over Cd] we shall shortly see that this viewpoint is useful in both 
characterizing the optimal classifier for the performance metric and in designing consistent learning 
algorithms for the metric. 

We next make the following continuity assumption on D, which is essentially a multi-class extension 
of a similar assumption on D in [2] (in the binary label setting). 

Assumption A (Continuity of D). Let JJ be a random variable distributed uniformly over the 
simplex A n , and let y be a base measure over A n such that y(A) = P (U £ A), VA C A n . Let re 
denote the probability measure that is associated with the random variable f](X). We will say that 
a distribution D satisfies Assumption A if re is absolutely continuous w.r.t. y. 

We shall also make a mild assumption on ip that is satisfied by all performance metrics in Tabled] 
except the min-max metric. 

Assumption B. We will say that ip : [0, l] nXTl —satisfies Assumption B w.r.t. distribution D 
if it is continuous, differentiable and bounded over Cd, and is strictly increasing in the diagonal 
elements of its argument and non-increasing in the non-diagonal elements of its argument. 

Under the above assumptions on D and ip, we now show that a ^-optimal classifier always exists and 
can be obtained by maximizing a decomposable performance metric constructed from the gradient 
of ip at the optimal confusion matrix. 

Theorem 2 (Characterization of ^-optimal Classifier for a General Multi-class Non-de¬ 
composable Metric Under Continuous Distributions). Let distribution D satisfy Assumption 
A, and ip : [0, l] nxn —>-M + satisfy Assumption B w.r.t. D. Then there exists a classifier h* : X — » A n 
that is ip-optimal. Furthermore, for G* = VV’(conf(h*, D)), we have 

0 A argmax (G*, conf(h, D)) C argmax , i/>(conf(h, D)), 

\i\X —y An \i’.X —y An 

and thus any classifier h : X —> A n of the following form is ip-optimal: 

hi(x ) > 0 only if i £ argmax g* T r](x). 

i/e[n] 
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The above theorem is a multi-class generalization of the result in [2j for binary monotonic 
performance metrics, and in addition also gives the precise form of the optimal classifier for the 
given performance metric. By a simple application of this theorem, we recover previous results 
on the form of the optimal classifier for performance metrics that fractional-linear [3] such as the 
F-nreasure and Jaccard coefficient [6], and also for the AMS metric [5j. 

Before we prove Theorem El we will find it useful to state the following lemma. 

Lemma 3 (Uniqueness of Optimal Confusion Matrix for Gain Matrices Obtained from 
Gradients of ip). Under the assumptions on D and if in Theorem [H for any C* € Cd, we have 

argmax Cg?D (V^(C*),C} = argmax CgCD (V^(C*), C). 

Moreover, the above set is a singleton. 

The proof of Theorem [2] then follows from the first order necessary conditions for optimality of a 
confusion matrix and the above result. 

Proof of Theorem [H We shall first show that there exists a ^-optimal classifier. By compactness 
of Cd, we know that there exists C* € Cd such that 

ip(C*) = max if(C) = sup i(’(C). 

It remains to be shown that there exists a classifier that achieves this confusion matrix, i.e., C* € Cd- 
For this, we note from the first order necessary condition for optimality of C*, given convexity of 
Cd (see Proposition [T]), that 

(Vip(C*),C*) > (V^(C*),C), VC € Cd- (4) 

The above equation along with Lemma [3] implies that 

argmax CgSo (V^(C*),C) = argmax CgCo (V^(C*), C) = {C*} . 

Thus C * £ Cd and hence there exists a clasifier h* : A—>-A n such that C* = conf(h*,D). This 
completes the proof of existence of a ^-optimal classifier. 

Next for G* = \7ip( C*), we further have 

argmax (G*, conf(h, D)) = {h : X —>• A n : conf(h, D) = C*} C argnrax r/;(conf (h, D)). 

Yv.?C —y An h \?C —y 

Clearly, a classifier h : X —>• A n that maximizes the linear performance metric (G*, conf(-, D)} is also 
■^-optimal; as seen in Eq. (j2|) . such a classifier takes the form given in the theorem statement. □ 

Remark 1 (Necessity of continuity Assumption A on D). We note here that for the above 
characterization to hold for a general non-decomposable performance metric, the continuity assump¬ 
tion on distribution D (Assumptions A) is indeed necessary. We illustrate this fact for the H-mean 
performance metric by constructing a simple distribution that does not satisfy this assumption, and 
where a classifier of the form in the theorem statement is not necessarily optimal. Consider the 
following distribution D over {x} x {1,2} with r)(x) = [^, . It can be seen that the unique 


optimal classifier for the H-mean performance metric is h*(x) = [|, ^] T , whose confusion matrix 
C* and the gradient of if at C* are given by: 



'1 1' 


1 T 

c* = 

4 4 

1 1 

4 4 

; G* = V^(C*) = 

2 2 

1 1 

2 2 


Clearly, any classifier h : X —>• A 2 will have (G*,conf(h, D)) = 0; hence 

{h. \ X — )-A 2 } = argmax (G*, conf(h, D)) D argmax ?/;(conf(h, D)) = {h*}. 
h.X —> A 2 h:X^-A 2 

It is worth noting that for certain restricted families of performance metrics, the characterization 
in Theorem [2] holds even without Assumption A on the distribution; this is the case, for example, 
when if is fractional-linear (e.g., F-measure, JAC) [3J, 0] and is convex (e.g., AMS metric) [5]. 

Remark 2 (Extension to the min-max metric). A result similar to the one in Theorem^ also 
holds for the min-max metric, where it is well known from classical detection theory (in particular, 
from min-max hypothesis testing) that the optimal classifier for this metric is obtained by maximiz¬ 
ing a decomposable metric with an appropriate gain matrix M- In fact, one can show that if h* 
is an optimal classifier for the min-max metric if MM , and G* is in the sub-differential of if MM at 
conf(h*,D), then 

0 argmax (G*, conf(h, D)) C argmax i/> MM (conf(h, D)). 
hiA ’— n h:X ^ A n 

4 A Consistent Plug-in Method for Multi-class Non-decomposable 
Metrics Based on a Brute-force Search 

Based on the above characterization of the optimal classifier of a non-decomposable metric, we now 
consider a simple plug-in based learning algorithm for a multi-class non-decomposable metric that 
uses a brute-force search over gain matrices; this approach can be seen as a natural extension of 
the binary plug-in method in Algorithm [TJ We show that this method is consistent with respect 
to a general non-decomposable metric, and also provide an explicit regret bound for this method 
for the special case of performance metrics that exhibit a certain convexity-like property. In the 
next section, we design an alternate efficient learning algorithm based on the conditional gradient 
algorithm which is consistent for a large family of non-decomposable metrics. 

Clearly, if the optimal confusion matrix C* for a multi-class non-decomposable metric if is 
known apriori, one can learn a simple plug-in classifier by applying the gradient of if at C* to a 
suitable class probability estimator. In the absence of knowledge of C*, a natural first-cut approach 
would be to perform a brute-force search over all gain matrices with bounded entriefl and pick 
the one for which the resulting plug-in classifier yields maximum performance value on a held-out 
part of the training set (see Algorithm [2|). While for the binary case (n = 2), this brute-force 
search essentially reduces to a search over thresholds (on the class probability estimate) that can 
be performed efficiently in time linear in the number of held-out instances (as seen in Algorithm 
[I]), for the general multi-class case, it is not clear if an exact search is tractable; in practice, 

4 Since a plug-in classifier constructed from a gain matrix is invariant to scaling of entries of the matrix, it suffices 
to perform the search over gain matrices with bounded entries. 
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Algorithm 2 Brute-force Plug-in Algorithm for Multi-class Non-decomposable Performance Metric 
Input: S = (Oi,yi),..., (x m ,y m )) € (A x [n]) m , ip : [0, l] nxn -»M + 

Parameter: a € (0,1) 

Split S into two sets S' and S" with sizes mi = [(1 — a)mj and m 2 = [am]. 

Learn rj s , = CPE(S'), where CPE : U“ =1 (A x [n]) m —A^ is a suitable CPE algorithm 
VG G [—1, l] Tlxn , define he : A — » A n such that [h G (x)]. = 1 if i = argmax^gg J r)s '( x ) 

G s € argmax Vd s „ [h G ] 

Ge [-i,i] nxn 

Output: hs = 


this maximization over gain matrices can be performed approximately by considering only a finite 
number of matrices obtained from a fine-grained grid. 

We now show that (under a continuous distribution) the brute-force plug-in method is statisti¬ 
cally consistent with respect to the given performance metric. 

Theorem 4 (Consistency of Brute-force Plug-in Algorithm for Multi-class Non-decom- 
posable Metrics). Let D satisfy Assumption A, and ip : [0, l] nxn —>■ M+ satisfy Assumption B 
w.r.t. D. If hs is the classifier learned by Algorithm [D using training sample S = (S',S") € 
(A x [n]) m with parameter a € (0,1); and the CPE algorithm used in Algorithm [H is such that 

E v [H^S'PO — r ?(^)|li] then regret[hs] A 0 (as m — y 00 ). 

The above guarantee applies to all performance metrics in Table HJ Before we prove this result, 
we state a couple of lemmas; in the first lemma, we consider a classifier obtained by applying a 
fixed gain matrix to a class probability estimation model, and show convergence of the entries of 
the confusion matrix for this classifier to those of a classifier obtained by applying the gain matrix 
to the true class probability function; in the second lemma, we give a uniform convergence bound 
for the confusion matrix of a set of weighted argmax classifiers. 

Lemma 5 (Convergence of conf for fixed gain matrix). Let D satisfy Assumption A. Let 
r}<j : A —>• A n be a class probability estimation model learned using a sample S drawn i.i.d. from D m . 
For a fixed gain matrix G G [0, l] nxn such that no two columns are identical, let h G : A —>• A n and 
h G : A —» A n be classifiers constructed as follows: [h G (x)].. = 1 if i = argmax^g^gj r}(x), Vx G A 

and [h G (x)]. = l if i = argmax^jg J%(x), Vx G A. //% is such that E x [||%(A)- T]{X) || J A 

0, then yi,j, [conf(h G , D)] „ A [conf(h G , D)] „ (as m ^ 00 ). 

Lemma 6 (Uniform Convergence Generalization Bound for conf Over FL^). Let pi : 

A —be a fixed function and S € (A x [n]) m be a sample drawn i.i.d. according to D m . For any 
5 G [0,1] , we have with probability at least 1 — 6 (over draw of S from D m ), 


sup ||conf(h,U) — conf(h, ,D|r)|| oo < C 

h £ 

where C > 0 is a distribution-independent constant. 

We are now ready to prove Theorem |4j 


n 2 log(n) log(m) + log(n 2 /<5) 


m 


10 










Proof of Theorem [/} By Theorem [2j a ^-optimal classifier exists. Let h* : X —>• A„. be one such 
classifier and let G* = V'i/ ; (conf(h *,D)). Further, let he* : X —>• A n be a classifier such that 
[hc*(a:)]i = 1 if i = argmax^gr n ]g* T rj(x ); then again by Theorem [21 Vd[ h G *] = Po[h*]. Also let 
he* : X A n be such that [h G *(a:)]i = 1 if i = argmax yg [ n jg * T fj S i(x). Thus, 


regret^, [hs] = 


< 

< 


V D [h*} - P D [hs] 

Pd[ Hg*] - Pd [hs] 

Pd [h-G*] - Pd[ Lg*] + Pd[ Lg*] - Pd s „ [h G *] + V Ds „\hG.\ - Pn [hs] 

Pd[ Lg*] - Pd[ hG*] + ^ d [hG*] — Pd s „ [h-G*] + Po s „ [h,s] — Vd\^-s\ 

Pd[ Bg*] - Pd[ he*] + sup (Vd[ h] - Pd s „ [h]) + sup ("Po „ [h] - P D { h]) 

heWffj, h6\, 

^d^g*] - Pz)[h G *] + 2 sup \Pd\P\ ~ PD s n [h] |, 

'---' 

term A v _f_„ 

V 

term# 


where the fourth step follows by definition of hg. By assumption B on if, the matrix G* has no two 

identical columns, and hence by Lemma [5] we have that conf(h G *) converges to conf(h G ») as m 

P 

goes to oo. Along with the continuity of ip, this ensures that term^ —> 0. By suitably conditioning 

p 

on S' and using the uniform convergence bound in Lemma EJ one gets terrn^ —> 0. □ 


For a special class of performance metrics that satisfy a certain convexity-like property, we have 
an explicit regret bound guarantee for the brute-force plug-in method. 

Theorem 7 (Regret Bound for Brute-force Plug-in Algorithm for Convex-like Non-de- 
composable Metrics). Let D satisfy Assumption A, and ip : [0, l] nxn — )• M+ satisfy Assumption 
B w.r.t. D. Furthermore, let ip be L-Lipschitz w.r.t. the i\ norm over Cd, and be such that there 
exists f > 0 such that ip( C) — ip(C) < f(Vip(C), C — C'), VC, C' £ Cr>■ If hs is the classifier 
learned by Algorithm [21 using training sample S = (S', S") £ (X x [ n]) m with parameter a £ (0, 1), 
then for any 5 £ [0,1], we have with probability at least 1 — 5 (over draw of S from D m ): 

regre t *|h s ] < 2L{E.v[|fe,(X) - „(*)!!,] + 2LC , 
where C > 0 is a distribution-independent constant. 

The above result applies to several performance metrics including the AMS measure (£ = 1) [5], 
the binary F-measure (£ = 1 /7Ti) [5j and the multi-class micro F-measure (£ = 1/(1 — 7Ti)) [2]. The 
proof of this theorem follows a similar progression as that of Theorem [4] and additionally makes 
use of the convexity-like property of ip and the following regret bound for a linear/decomposable 
performance metric defined using a bounded gain matrix. 

Lemma 8 (Regret Bound for Linear/Decomposable Performance Metric with Bounded 
Gain Matrix). Let G £ [—L,L] nxn be a fixed gain matrix. Let rj : X —> A n be a class probability 
estimation model and h G : X —>• A„ be a classifier constructed such that [h G (a;)]. = 1 if i = 
argmax yg [ n ]g.J fj(x). We then have 

max (G, conf(h, D)) — (G,conf(h G , D)) < 2LEx[||f7(A) — r](X)\\ 1. 
h:X —> An L iJ 
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Remark 3 (Connection to the method of Parambath et al. (2014) [4j). For certain 
classes of performance metrics, the brute-force method in Algorithmic can be made more efficient 
by considering in the maximization step, only those gain matrices that are obtained from gradients of 
ip at feasible confusion matrices in Co- This is beneficial for example, in the case of fractional-linear 
performance metrics such as the binary and micro F-measure, where any gradient obtained from a 
feasible confusion matrix can be parametrized using a single scalar. The method of Parambath et 
al. ( 2014 ), which makes use of this fact, can be seen as a special case of Algorithmic 


5 A Consistent and Efficient Algorithm for Multi-class Non-decomposable 
Metrics Based on the Conditional Gradient Method 


While the (brute-force) plug-in method analyzed in the previous section is consistent for any non- 
decomposable metric for which the optimal classifier is of a certain desired form, the number of 
parameters that need to be tuned in this method grows with the number of classes n; in particular, 
the number of evaluations of the performance metric required in this method could be exponential 
in n. In this section, we provide an alternate efficient learning algorithm based on the conditional 
gradient (CG) optimization method and show that this algorithm is consistent for a large family 
of concave performance metrics. Also, unlike the brute-force plug-in, the CG based method makes 
no assumption on the form of the optimal classifier and hence on the underlying distribution. 

More specifically, we pose the problem of learning a classifier for a non-decomposable metric as a 
constrained optimization problem over the space of feasible confusion matrices, and explore the use 
of optimization methods for solving this problem. However, unlike a standard optimization problem 
where the constraint is explicitly specified, in the problem that we consider, testing feasibility of 
a confusion matrix is not tractable in general; this precludes the use of standard gradient descent 
based constrained optimization solvers for this problem. Instead, we make use of the conditional 
gradient (CG) method which does not require the constraint set to be explicitly specified, and 
instead only requires access to a linear optimization oracle over the constraint set [36] . In particular, 
this method proceeds via a sequence of linear optimization steps, each of which is equivalent to 
maximization of a decomposable performance metric and thus can be solved efficiently. 

We first present an idealized version of the above CG based learning algorithm, where we 
assume access to the underlying distribution D (see Algorithm [3]). Each iteration of this algorithm 
maintains a classifier h J and (approximately) maximizes a decomposable performance metric given 
by the gradient of ip at the confusion matrix for h J . For a concave and smooth ip, one can derive, 
by extending the standard CG analysis, the following regret bound guarantee for this algorithm 
[7]; we shall later see how this guarantee can be extended to non-smooth performance metrics. 


Theorem 9 (Regret Bound for (Idealized) Conditional Gradient Algorithm for Concave 
Smooth Non-decomposable Metrics). Let ip : [0, l] nxn —>• R + be concave overCo and fd-smooth 
w.r.t. the £\-norm overCo ■ Let h 1, w be the classifier learned by Algorithmic with parameters k G N 
and e > 0. Then 


regret^, [h FW ] < 2e + 


8 fd 

Km + 2 


Since in practice, one does not have access to D, we consider a sample-based version of Algorithm 
[31 where in each iteration, the gradient for the current classifier is computed using a sample-based 
estimate of the confusion matrix of the classifier and the solution to the linear maximization step 
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Algorithm 3 Idealized Conditional Gradient Algorithm for Multi-class Non-decomposable Perfor- 
rnance Metric._ 

Input: D, if : [0, l] nXn -X M+ 

Parameters: k € N, e > 0 

Choose an initial classifier h° : A —x A n 

T = nm 

for j = 1 to T do 

G J = V^(conf(h J_1 , Z?)) 

Approximate Linear Maximization: 

Choose u 7 : A —x A n such that (G 7 , conf(u 7 , D)) > max (G J , conf(u, D)) — e 

u: (T —> A n 

Construct b 7 : A —x A n such that IP(x) = (l — -p^)hf~ l (x) + -^vp{x), Vx € A 

end for 

Output: h FW = h T 


is a plug-in classifier obtained from a suitable class probability estimation model (see Algorithm 
ED; we shall refer to this method as the ‘BayesCG’ algorithm. Clearly, this algorithm runs in time 
polynomial in the number of classes n and number of training example m. 

It is important to note that the BayesCG algorithm essentially mimics the earlier idealized 
algorithm, with the approximation factor e in the linear maximization step now depending on 
the input training sample. Using this observation and the above regret bound guarantee for the 
idealized algorithm, we now show that the BayesCG algorithm is consistent for any concave smooth 
performance metric. 

Theorem 10 (Consistency of Sample-based Conditional Gradient Algorithm for Con¬ 
cave Smooth Non-decomposable Metrics). Let S = (S ', S") £ (Ax [n]) m be the given training 
sample drawn i.i.d. from distribution D. Let if : [0, l] nXn —x R+ be concave over Cr>, L-Lipschitz 
overCo s „ and fi-smooth, both w.r.t. the £i norm. Let h^ w be the classifier learned by Algorithm 
[7] using training sample S with parameters a € (0,1) and k € N. Then for any 5 € [0,1], we have 
with probability at least 1 — 5 (over draw of S from D m ), 

regre t*[h™-] < 4 iEx [iifew - .my + + JL, 

where C > 0 is a distribution-independent constant. Thus, if the CPE algorithm used in Algorithm 
[7] is such that E_y [\\rjsi (A) — rj(X) || J -^->0, then regret^, [I 15 ] 0 (as m ^ 00 ). 

A key element of the proof of the above theorem is in showing that the BayesCG algorithm solves 
the CG linear maximization step approximately; this makes use of Lemma [ 8 ] and Lemma [ 6 ] in the 
previous section (along with the smoothness assumption on if). 

Lemma 11 (Approximation Factor for Linear Maximization Step in Algorithm [4]) . 

Let if : [0, l] nxn —x M + satisfy the assumptions in Theorem Q3 Let CP and IP be the classifiers 
constructed in any given iteration j of Algorithm^ using training sample S = (S',S") € (A x [n]) m 
and parameter a € (0,1). Also, let G - 7 = VV’(conf(h- 7_1 , D)). Then for any 6 € [0,1], we have with 
probability at least 1 — <5 (over draw of S from D m ) for all 1 < j <T 

(G J , conf(tP, D)) > max (G- 7 , conf(u, D)) —es, 
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Algorithm 4 BayesCG: Sample-based Conditional Gradient Algorithm for Multi-class Non- 
decomposable Performance Metric. 

Input: S = (Oi,yi),..., (x m ,y m )) € (X x [n]) m , if : [0, l] nxn ->R + 

Parameters: a € (0,1), k € N 

Split S into two sets S' and S" with sizes m\ = [(1 — a)m\ and m 2 = \am\. 

Learn rj s , = CPE(S'), where CPE : U)^ =1 (A x [n]) m —»A(f is a suitable CPE algorithm 
Choose an initial classifier h° : X —>• A n 
T = nm 

for j = 1 to T do 

& = V^(conf(h J ' _1 , D S ")) 

Approximate Linear Maximization: 

Construct u J : X —>• A n such that uj(x) = 1 if i = argmax ye [ n ] gf r r] S /(x), Vx € X 

Construct h J : A —> A n such that h J (x) = (l — j^)h- ? " - 1 (x) + j£j-u J (x), Vx € A 

end for 

Output: hg W = h r 


where 

% = 2Z.E. v [||SJ s ,<X) - ^)||J + 2^C^/ " 2l0g <" )l0g <2 )+1 ° E( " 2/{) . 

/or a distribution-independent constant C > 0. 

Proof of Theorem] 1 (A The proof follows from Lemma fill and the regret bound in Theorem [9j □ 


While the consistency result in Theorem [TUI applies only to smooth performance metrics, for non¬ 
smooth performance metric such as the G-mean metric and several others in Table [H one can apply 
Algorithm 2] to a suitable smoothed version of the metric (indicated below by if p : [0, l] nxn —>-R + 
for some p € ( 0 , 1 ), with lirn p 0 if p = if), and obtain the following regret bound for the original 
performance metric. 

Theorem 12 (Regret Bound for Sample-based Conditional Gradient Algorithm for a 
Larger Family of Non-decomposable Metrics). Let S = (S',S") € (X x [n]) m be the given 
training sample drawn i.i.d. from distribution D. Let if : [0, l] nxn —>-R + be such that for any 
p € (0,1), there exists if p : [0, l] nxn —»R + which is concave over Cd, L p -Lipschitz w.r.t. the i\ 
norm overCr> s „ and f3 p -smooth w.r.t. the £\-norm, with 

sup \if(C)-if p (C)\<9(p), 

C G Cd 


for some strictly increasing function 9 : R + —>R_|_. Let hg W ’ p be the classifier learned by Algorithm 
[/] when applied to if p with training sample S and parameters k € N and a € (0,1). Then for any 
£e[o, 1], we have with probability at least 1 — 5 (over draw of S from D m ) 


regret^, [h^ W,p ] < AL p E x [\\r, s ,{X) - ^X)^} + ^ p n 2 C 


where C > 0 is a distribution-independent constant. 



log(n) log(am) + log(n 2 /<5) 
am 


8/3 P 

nm + 2 


+ 2 0(p), 
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Table 2: Performance Metrics Vd[ h] = ?/)(conf(h, D) = C), for which if : [0, l] nxn —R + is concave 
but non-smooth (see Table Q] for the form of if for these metrics). A smoothed version of this 
function if p : [0, l] nxri —>M + for any p G (0,1) is given in the second column; in each case, "i/’p is 
also concave. The form of 9{p) (defined in Theorem ll2l) . the Lipschitz constant L p and smoothness 
parameter (j p for the smoothed function are given respectively in the third, fourth and fifth columns. 
Here, we denote 7r m i n = min ye [ n ] Tr y . Details of all calculations can be found in Appendix [B] 


metric 


MC) 


m 

L P 


Pp 

H-Mean (HM) 

3 

II 

Es=i C y,y+P V 
1 C y,y+P ) 

"I 

ir~p 

''min 

n 

P 


2 n 

7 1 

Q-Mean (QM) 


Cy 

^ y =1 

,y + P\ 2 
,y+Pj 

— ~/ = P 

^min \/ n 

1 1 
y/n P 

2 

y/n 

M 1+ p) 

G-Mean (GM) 

1=T 

« 3 

II 

( c y,y + P V 

\Sj = l Cy J +Pj J 

1 In 

\ 

2 p 1,n 

Ilfl+I) 1 - 1 /" 

n p\ p ) 

1 1 1 

(i+r 2/n 


5.1 Instantiation to specific concave multi-class performance metrics 

We now instantiate the regret bound in Theoremll2lto several performance metrics in TableQ]which 
happen to be concave but non-smooth. Table [2] contains the smoothed version of these performance 
metrics, along with the resulting Lipschitz and smoothness constant. We then have the following 
consistency result for these metrics as a corollary of Theorem [T21 

Corollary 13 (Consistency of the Sample-based Conditional Gradient Algorithm for 
H-mean, Q-mean and G-mean). Let S = ( S',S") G (X x [n]) m be the given training sample 
drawn i.i.d. from distribution D. For each of H-mean, Q-mean and G-mean, let if p be chosen as 
prescribed in Tabled Let h^ W,p be the classifier learned by Algorithm^ when applied to if p with 

training sample S. If the CPE algorithm used in Algorithm^is such that Ex [||i)g/(A)— 77 (A)|| 1 ] -A- 
0 , then for each of the above performance metrics, one can choose values of p—>0 (as m —>• ooj so 
that regret^, [hg] 0 (as m ^ oo). 

Remark 4 (BayesCG is consistent for non-continuous distributions). While the consistency 
guarantee for the brute-force plug-in method discussed in Section [7] makes crucial use of the form 
of the optimal classifier for the given performance metric, requiring a continuity assumption on the 
distribution (Assumption A), the BayesCG method requires no such assumption on the distribution. 
In particular, when a distribution does not satisfy Assumption A, a randomized classifier can yield 
a strictly higher performance value than the best deterministic classifier for the given performance 
metric (e.g., for the distribution described in RemarkUi the randomized classifier h* yields a strictly 
higher H-mean value than all deterministic classifiers). Since the brute-force plug-in algorithm 
learns a deterministic classifier of a specific form, it fails to be consistent for such distributions. 
On the other hand, the final classifier returned by the BayesCG algorithm is a randomized classifier 
obtained from an ensemble deterministic classifiers, where the size of this ensemble grows with the 
number of training examples, thus enabling this method to handle a general distribution that does 
not satisfy Assumption A. 

Remark 5 (Extension to non-differentiable concave metrics). The consistency results that 
we have seen so far for the BayesCG algorithm have assumed that the given performance metric 
is concave and differentiable, and hence do not apply to the min-max metric in TableUi which is 
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(concave, but) not differentiable. It is indeed possible to derive a version of the BayesCG algorithm 
that is consistent for such continuous concave metrics, by working with a smooth differentiable 
approximation to these performance metrics & The proof of consistency for the resulting learning 
algorithm is however slightly more involved, requiring us to deal with approximate gradients to these 
metrics m and is reserved for a longer version of this paper. 

Remark 6 (Extension to fractional-linear metrics). We would also like to point out that there 
is a variant of the CG method used in the BayesCG algorithm that can be applied to non-concave 
optimization objectives 138) . but this method can get stuck in a stationary point that is not a globally 
optimal solution, and hence the resulting learning algorithm need not be consistent for a general 
non-concave performance metric. However, one can show (without an explicit regret bound) that 
this variant of the BayesCG algorithm is consistent for a special class of non-concave performance 
metrics that are fractional-linear, such as the binary F-measure, the JAC metric and the multi¬ 
class micro F-measure, where owing to the pseudo-linear structure of these performance metrics, 
all stationary points are globally optimal solutions 139) . 

6 Conclusion 

We provide a unified framework for analysing a general non-decomposable multi-class performance 
metric that cannot be expressed as a sum of losses on individual examples such as the multi-class 
F-measure and the multi-class G-mean metrics. Using this framework, we give a characterization 
of the optimal classifier for a general non-decomposable performance metric, subsuming several 
previous results on binary non-decomposable metrics. We then design a efficient learning algorithm 
based on the conditional gradient (CG) optimization method that is consistent for a large family of 
concave performance metrics. Our proof techniques are novel and involve application of tools from 
the optimization literature, particularly those used in the convergence analysis of the CG method. 
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A Proofs 

A.l Proof of Proposition [I] 

Proposition (Convexity of Cd). Cp> is a convex set. 

Proof. Let Ci, C 2 €E Cp>. Let A G [0,1]. We show that ACi + (1 — A)C 2 G Cp>. 

By definition of Cp>, there exists randomized classifiers hi, ti 2 : X —>• A n such that 

Ci = conf(hi, D ) 

C 2 = conf(h 2 , D) 


Consider the randomized classifier h x : X —> A n defined as 

h A (x) = Ahi(x) + (1 — A)h 2 (x) . 

It can be seen that 
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A. 2 Proof of Lemma [3] 

While Lemma[3]is simple to state, its proof is rather intricate and hence we give its proof via several 
intermediate lemmas and propositions. 

Lemma 14 (Confusion matrix as an integration). Let f : A n —> A n . Then 

conf(f o r], D) = I p(f(p)) T dzy(p) . 

J pG A n 

Proof. 

[conf(f o rj ,£>)]„ = E {xx) ~ D [fj(v(X)) • 1 (Y = i)\ 

= Ep~ v E(x,Y)~D [fj{ p) • 1 (Y = i)\v( x ) = p] 

= E p ~v[pifj(p)] 

□ 


Proposition 15 (Sufficiency of conditional probability). Let D be a distribution over X x y. 
For any randomized classifier h : X — >• A n there exists another randomized classifier h 1 : X — >• A n 
such that conf(h, D) = conf(h',Zl) and h' is such that h'(x) = {(rj(x)), for some f : A„—>■ A„. 

Proof. Let h : X —> A n . Define f : A n — >■ A n as follows, 


f(p) = E x ^[h(A)|r ? (A) = p] . 


We then have for any i,j £ [n] that, 


[conf (h, D)\ { j 


E (XX) ^ D [h j (X)-l(Y = i)} 

Ep^.E ^y^nihfiX) ■ 1 (Y = i)\rj(X) = p] 

E p ~v [E {x ^ D \hj(X)\rj(X) = p] • E {X ^ d [-HY = i)\ V (X) = p]] 
E p [fj(p)pi\ 

[conf(f o r], D)] t j 


where the third equality follows because, given tj(X), the random variables X and Y are indepen¬ 
dent. □ 


Lemma 16 (Continuity of the conf mapping). Let D be a distribution over X x y. Let 
fi, h '■ A n ->• A n . Then 

||conf(fi o n,D) - conf(f 2 o t),D)\\ 1 < f ||fi(p) - f 2 (p)||icb(p) . 

J pGA n 

Proof. Let fi,f 2 : A n -»A n 

conf(fi o rj,D) - conf(f 2 o rj,D) = ( p(fi(p) - f 2 (p)) T a^(p) 

•J pGA n 

| |conf(fi o r], D) — conf(f 2 o r], D)\\ 1 < f ||p(fi(p) - f 2 (p)) T ||i^(p) 

J pG An 
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□ 


p| 111 |fl (p) - f 2 (p)||l<Mp) 

/peA„ 

[ ||fl(p) - f 2 (p)||l^(p) 

pGAn 


Lemma 17. Let d > 0 be any integer. Let V C W l be compact and convex. Let f : R d —>-R be 
an affine function such that it is non-constant over V. Let V be a vector valued random variable 
taking values uniformly over V. There exists a constant a > 0 such that for all c € R and e € R + 
we have 

P (f(V) € [c,c + e]) < ae . 


Proof. Let us assume for now that affine hull of V is the entire space M. d . 

For any integer i and set A, let volj(^l) denote the z-th dimensional volume of the set A. Note 
that volj(*4.) is undefined if the affine-hull dimension of A is greater than i and is equal to zero if 
the affine-hull dimension of A is lesser than i. 

For any r > 0 and any integer i > 0 let Bi(r) C R* denote the set Bi(r ) = {x E R* : ||x| I 2 < r} . 
Also let R be the smallest value such that V C B d (R). 

Let the affine function / be such that for all x € R d , the value /(x) = g T x + u. By the 
assumption of non-constancy of / on V we have that g 7 ^ 0 . 

We now have that 


P (f(V)€[c,c + e]) = 


< 


< 


void ({v e V : c — u < g T v < c — u + e) 
vold(V) 

void ({v € B d (R) : c-u < g T v < c-u + e) 
vold(V) 

vold_i(-Bd-i(fi)) 

vold(V)||g || 2 


The last inequality follows from the observation that d-volume of a strip of a d dimensional sphere 

of radius r is at most the d — 1 volume of a d — 1 dimensional sphere of radius r times the width 

of the strip, and the width of the strip under consideration here is simply A . 

IISII 2 

Finally, if the affine hull of V is not the entire space R d , one can simply consider the affine-hull 
of V to be the entire space and all the above arguments hold with some affine transformations and 
a smaller d. □ 


Lemma 18. Let D be a distribution over X x y . Let G E M nxn be such that no two columns are 
identical. Let the measure over conditional probabilities v, be absolutely continuous w.r.t. the base 
measure p.. Let c > 0. Let A c C A n be the set 

A = {p G A n ■ (p T G) (1) - (p T G) (2 ) < c} , 

where for any vector v E M n and integer i € [n], the scalar (v)(j) denotes the i th element among 
the components of v, when they are arranged in descending order. Let r : R + —»R_|_ be the function 
defined as 

r(c) = v(A c ) . 

Then 


20 






(a) r is a monotonically increasing function. 

(b) There exists a C > 0 such that r is a continuous function over [0, C}. 

(c) r(0) = 0. 

Proof. Part (a): 

The fact that r is a monotonically increasing function is immediately obvious from the obser¬ 
vation that A a C Ab for any a < b. 

Part (b): 

Let 

C = - min{d el:g„- g,/ = de for some y, y' £ [n},y A v'} , 

where e is the all ones vector. If there exists no y, y' such that g y — g y , is a scalar multiple of e, 
then we simply set C = oo. Note that by our assumption on unequal columns on G, we always 
have C > 0. 

For any c > 0 and y, y' £ [n] with y A y' , define the set A y ' y as 

A y c ’ yl ={peA n : p T g :y - p T g J/ / < c} . 

For any c, e > 0, it can be clearly seen that 

v(A c +e) - v{Ac) = v(A c+e \ Ac) , 

A+.\A C U (A y £\A>A) , 

y,y , ^[n],y^y l 

u(A c+e \A c ) < Y, "{ Ay c+e\A y c ’ y ') ■ 

y,y'&[ n ]^y' 

Hence, our proof for continuity of r would be complete, if we show that v (A%.+ e \ A y ’ y ^ goes to 
zero as e goes to zero for all y A v' an d c € [0, C\. 

Let c € [0, C] and y, y' £ [n] with y A v' 

A y c f e \ A y c ’ yl = {p € A„ : c < p T (g :y - g y >) < c + e} . 

If g y — g y , = de for some d, we have that p T (g?/ — g y ,) = d and d > C by definition of C. Hence 

for small enough e the set A y f e \ A y c ' y is empty. 

If g y ~ g y' is not a scalar multiple of e, then p T (g y — g y ,) is a non-constant linear function of 

p over A„. From Lemma fT71 /-i(^A y,y e \ A y ’ y j goes to zero as e goes to zero. And by the absolute 

continuity of v w.r.t. \i. we have v^Afjf+e \ A c’ y ^ goes to zero as e goes to zero. 

As the above arguments hold for any c € [0, C] and y, y' £ [n] with y A t)', tl ie proof of part (b) 

is complete. 

Part (c): 

We have, 

A, £ 1J 

2/,ye[n],y^y' 
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To show r(0) = 0, we show n Aq ,y ^j = 0 for all y y'. Let y, y' € [n] with y / y' , then 

(~Aq V ' n • A o’ V ) = {p e A n : p T (gy - gy') = 0 } . 

If gy — g y i = de for some d / 0, the above set is clearly empty. If g y — g y r is not a scalar multiple 
of e, then p T (g y — g y ') is a non-constant linear function of p over A n , and hence by Lemma 

m we have that /j,(^Aq’ v fl Aq ’ y ^j = 0. By the absolute continuity of v w.r.t. y we have that 

u(A y / ca{’ v ) = 0 . 

As the above arguments hold for any y, y' € [n] with y / y', the proof of part (c) is complete. □ 

Lemma 19 (Uniqueness of Optimal Confusion Matrix for Special Gain Matrices). Let 

D be a distribution over X x y. Let v be absolutely continuous w.r.t. y, let G € M nxn be such that 
no two columns are identical. Then, 


argmax Cg ^j (G, C) = argmax CeCD (G, C). 


Moreover, the above set is a singleton. 

Proof. We shall proceed by showing that the maximizer of (G, C) over Cjo is unique and then show 
that there exists no other maximizer of (G, C) over Cd ■ 

Using Proposition [T5l we will only consider classifiers h : X —>• A n that can be be decomposed 
as h = f o r/ for some f : A n —>• A n . 

Prom Equation O we have that any f* € argmax f:An _ >An (G,conf(f o r],D)} is such that the 
following holds v- almost everywhere 

ft (P) > 0 only if i € argmax^ e[n] g,J p . 

We will show that the maximizer of (G, C) over Cd is unique, simply by showing that any f* 
satisfying the above equation has the same conf(f* o r], D ), which we in turn show by proving that 
any two functions f* satisfying the above condition is the same v almost everywhere. 

For a given p G A n , if T p = argmax yG r n i gjp is a singleton, then f*(p) is uniquely defined due 
to the sum to one constraint. If p is such that |T p | > 1 , then (p T G)(x) — (p T G)( 2 ) = 0. From 
Lemma [T8l the zz-measure of all such p vectors is exactly equal to r(0) = 0. 

This completes the proof of the uniqueness of the maximizer of (G, C) over Cd- Let us denote 
it by C*. Also let f* : A n —> A n with C* = conf(f* o r), D) refer to the following fixed function: 


/*(P) 


1 if i = argmax yg p T g. y 
0 otherwise 


Let C' G argrnax Ce ^j (G, C). Let us assume C* / C'. 

n n 

11°' - C’ll, = EE l C b - C U = 7 > 0 . 

i= 1 j=1 
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We shall go on to derive a contradiction as follows. By virtue of C £ Cd there exists a sequence 
of classifiers whose confusion matrices approach C. And hence these classifiers are all ‘close’ to 
maximal for the gain matrix G. We then show that these classifiers perform strictly worse than 
h* by exploiting that the confusion matrices of these classifiers are bounded away from C*. This 
provides us the required contradiction. 

As C' € Cd, we have that for all e > 0, there exists C e € Cp, such that ||C e — C'Hi < e. This 
implies that 


11 C e C* 11! > 7 — e , 

<G,C e )><G,C , )-||G|| 00 € > (G,C*} — ||G|looe . 


(5) 

( 6 ) 


Let f e : A n —> A n be s.t. C e = conf(f e o r/,D). Let B = {p £ A n : ||f*(p) — f e (p)||i > 
Applying Equation [5] and Lemma [16] we have 


7 — e < I |conf(f* o rj D , D) — conf(f e o , D 




< 


l 

f 2dz/(p) 

■JveB 


< / ll f *(p) - fe(p)l|l<Mp) 

' peA n 


+ 


7 




dv{ p) 


= 2v(B) + 1 -(l-v(B)) 

< MB) + l 

* T-l 


(7) 


For any c > 0 , define A c C A n as 


A c — {p G A n : (p T G)(!) — (p T G)( 2 ) < c} , 

From Lemma IT 8 l we have that v(A c ) is a continuous function of c close to 0 and v(Ao) = 0 . Let 
c > 0 be such that 

"(Ac) < ^ • ( 8 ) 

From Equations [7] and [ 8 ] we have v(B\ A c ) > ^ — |. Any p £ B \ A c is such that 

(p T G)(i) - (p T G) (2) > c and ||f*(p) - f e (p)||i > -j • 

For any p € A n , we have f*(p), has a 1 corresponding to the maximum value of p T G and 
zero elsewhere. For any p € B \ A c , we have ||f*(p) — f e (p)11 1 > 7 , and hence the value of f e (p) 
corresponding to the index of maximum value of p T G is at most (1 — 7 ). In particular, we have 

P T Gf e (p) < ^1 — -^j (p T G) 7 ) + (J^j (p T G)( 2 ) . (9) 

Thus we have, 

(G, C*) — (G, C e ) = / p T G(f*(p) -f e (p))dp D (p) 

•7 pGAn 
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If e < j, we have 


f p T G(f*(p) - f e (p))dv D (p) + [ 

J pGB\.4c J p 

> [ P T G(f*(p) - t e (p))dv D (p) 

JveB\Ar 

((p T G)(i) — p T Gf e (p))dz/ z? (p 


P T G(f*(p) - f £ (p))d^(p 


D, 


> 


> 


'peB\A, 

/ 

/pG B\A, 

i 

Jp£B\A, 

i 

/pG B\A 

f 2 

Jp£B\A c 8 
70/57 e 
~ 8 ~ V 16 ~~ 2 


P GA„\(6\i c ) 


La ( (pTG)(1) ~ ( X “ I) (pTG) (D “ (I) (p Tg )(2)) ^ D (p) 


P T G)(i) - (p 1 G)(2)) di/ w (p) 


T 


D t 


(G, C*) — (G, C e ) > 


7 2 c 

128 


The above holds for any e € (0, 5[], and both 7 and c do not depend on e. For small enough e, this 
contradicts Equation [ 6 l We thus have a contradiction for our assumption C* / C'. □ 

The proof of Lemma [3] simply follows from Lemma fl9l by observing that if if satisfies Assumption 
B, then no two columns of its gradient at any point are identical. 


A.3 Proof of Lemma [5] 

Lemma (Convergence of conf for fixed gain matrix). Let D satisfy Assumption A. Let 
rjg : X —>• A n be a class probability estimation model learned using a sample S drawn i.i.d. from D m . 
For a fixed gain matrix G £ [0, l] nXTl such that no two columns are identical, let hG : X -A A n and 
hG : X -A A n be classifiers constructed as follows: [he ( 2 ;)] i = 1 if i = argmax yG r n igj rj(x), Vx € X 

and [hG(x)] i = l if i = argmax yg [ n ]gj? 7 g(x), Vx €E X. Iff]gissuchthat'Ex[\\ : ng{X)—r](X)\\ 1 \ 

0, then yi,j, [conf(h G , D)] „ A [conf(h G , D)]^ (as m ^ 00 ). 

Proof. Fix e, 5, 5' > 0. By virtue of Ex [((^(A) — 77 (A)|| ] A- 0, there exists a such that for 
all m > M e j we have with probability at least 1 — 5 over the draw of S that 

E x [||%(A)-r 7 (A)|| 1 ] < e . 


Let m > M e jj. By Markov’s inequality we have that 


>. Y ||%(a)-t 7 (a)|| 1 > 


Ev[||f7g(A)-7 7 (A)|| 1 ] 

5 ' 


< 5 ' 


Hence with probability at least 1 — 5 — 5' over the draw of both S and X, we have 

||%(A)-r7(A)|| 1 <l. 


( 10 ) 
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Based on the above inequality we will argue that he and he have the same value for most 
instances. 

For any x G X. Let y*(x) = argmax yG [ n ]gJ rj(x) and y*(x) = argmax yg [ n jg J fjg(x). The 
following implications hold: 

h G (A)/h G (X) =► y*(X)j£y*(X) 

=> g y*( X )V( x ) > gf (x)V(X) and g< g^ (X) %(X) 

Using equation [TO] the following holds with probability at least 1 — 5 — 5' over X and S: 

h G (X) + h G (X) => E^ {x) v(X) < ^ {X) V(X) < E^ {X) V(X) + 2j, 

=* (Sy*(X)~gy*(X)) T V(X) G 

=> 3y,y' G [n],y fi y' s.t. (g y 

For any y, y' G [n] with y / y' define the set A yy i C A n as 

A hy i = {p G A n : (g y - g y /) T p € [0, 2e/5']} 

We thus have that h G (X) = h G (A) with probability at least 1 — 5 — 5' — J2 y y 'e[n\ y^y 1 ) - 

As G has no two identical columns we have that (g y — g y /) T p is never zero for all p G A„. Let 
y, y' € [n] with y A y'. If g y — g y r = de for some and d > 0, we have that A y y is empty for small 
enough 4. Otherwise, we have by Lemma [171 that /i(A y y) approaches 0 as ^ approaches 0. And 
by the absolute continuity of v w.r.t. y , we have that u(A y y/) also approaches 0 as jr approaches 

0 . 

Thus by having e, 5, 5' and ^ simultaneously approach zero, we have that the probability of the 
statement h G (X) = h G (X) approaches 1. And hence Vi, j, [conf(h G , D )]. . = |conf(h G , D)] ... 

L J IJ L J IJ 

□ 


2e 

J 


gy') T V(X) G 


„ 2 e 

0, y 


A.4 Proof of Lemma [6] 

Lemma (Uniform Convergence Generalization Bound for conf Over Let p : X —>-R n 
be a fixed function and S G (X x [n}) m be a sample drawn i.i.d. according to D m . For any 5 G [0,1], 
we have with probability at least 1 — 5 (over draw of S from D m ), 


sup 11 conf (h, D) 

h e 


conf(h, D~ t 


: C 


n 2 log(n) log(m) + log(n 2 /5) 


m 


where C > 0 is a distribution-independent constant. 

Proof. First observe that every function h G PL^ is such that for all x G X, the vector h(x) is 
always one of the co-ordinate vectors in R”. For any a, b G [n] we have, 


sup I [conf(h, DA]a,b — [conf(h, D)] ab 
hen,. 


sup 
h en^. 


^ no 

^YAKVi = a,h b (xi) 

m z — J 

i=i 


1) - E[1 (Y = a,h b (X) = 1)]) 
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= sup 
heu% 


j m 

^ V (l( yi = a, h( Xi ) = 1) - E[1(Y = a, h(X) = 1)]) 
rri z — J 


i=l 


where = {6, : A—»{0,1} : 3G € M™ xn ,Vx € X,h(x) = 1(6 = argmax tg [ n j g/p(x))}. The 
set PL^ can be seen as hypothesis class whose concepts are the intersection of n halfspaces in R" 
(corresponding to /x(x)) through the origin. Hence we have from Lemma 3.2.3 of Blumer et al. 
(1989) [40| that the VC-dimension of Pi is at most 2n 2 log(3n). Prom standard uniform convergence 
arguments we have that the following holds with probability 1 — 6, 


sup |[conf(h,D?)] a 6 - 

he-H„ 


[conf(h,D)] a J < C\ 


' ri 2 log(ra) log(m) + log (A 


m 


where C > 0 is some constant. Applying union bound for all a, 6 € [n] we have that the following 
holds with probability 1 — 6 


sup | |[conf(h, Dg)] - [conf(h,L>)]|| oo < C\ 


n 2 log(n) log(m) + log(- 


m 


□ 


A.5 Proof of Lemma [ 8 ] 

Lemma (Regret Bound for Linear/Decomposable Performance Metric with Bounded 
Gain Matrix). Let G € [-L, L] nxn be a fixed gain matrix. Let rj : X —> A n be a class probability 
estimation model and he : X —>• A n be a classifier constructed such that [hc(a;)] i = 1 if i = 
argmax yg [ n ]gj rj(x). We then have 

max (G, conf(h, D)) — (G, conf (he, D)) < 2LEx [||r/(A) — ^(A)^]. 
h:Af —y An 

Proof. Let h* : X —> A n be such that 

h*{x) = 1 if i € argmax^jgT r](x) . 

Hence by Equation [2] we have that 

h* € argmax h .^^ An (G,conf(h,D)) . 

We have that 

max (G, conf(h, D)) — (G, conf (h G , D)) 
h-.x ->■ A n 

= (G, conf(h*, D)) - (G, conf (h G , D)) 

= E x [r 7 (A)] T [Gh*(A)] - Exb?(A)] T [Gh G (A)] 

= E. Y [ 77 (X)] T [Gh*(A)] - E X [ 77 (A) - r/(A)] T [Gh G (A)] - E x [^(A)] T [Gh G (A)] 

< E x [r 7 (A)] T [Gh*(A)] - E x [ 77 (A) - f,(X)] T [Gh G (A)] - Ex[*?(A)] T [Gh*(A)] 

= Ex[ 77 (A) - t)(A)] t [G][ h*(A) - h G (A)] 

< 2LEx||t7(A)-77(A )|| 1 . 

□ 
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A.6 Proof of Theorem [7] 


Theorem (Regret Bound for Brute-force Plug-in Algorithm for Convex-like Non-de- 
composable Metrics). Let D satisfy Assumption A, and if : [0, l] nXTl —j► R + satisfy Assumption 
B w.r.t. D. Furthermore, let if he L-Lipschitz w.r.t. the i\ norm over Cp, and be such that there 
exists £ > 0 such that if(C) — if(C') < f(Vif( C),C — C'), VC, C' £ Cp. If hg is the classifier 
learned by Algorithmusing training sample S = (S', S") £ (X x [n]) m with parameter a £ (0,1), 
then for any 5 £ [0,1], we have with probability at least 1 — 5 (over draw of S from D m ): 


regret 


SM < 2L f E x [||!i s ,(X, - ^,||j + 2 


where C > 0 is a distribution-independent constant. 


Proof. By Theorem [2j a ^-optimal classifier exists. Let h* : X —>• A n be one such classifier and 
let G* = V^(conf(h*, D)). Further, let he* : T—be a classifier such that [hG*(x)]j = 1 if 
i = argmax^g g* T rj(x); then again by Theorem^ Pd[1ig*] = Pd[ h*]. Also let he* : X —> A n be 
such that [hG*(x)]i = 1 if i = argmax^g g* T rj s ,(x). Thus, 


regret^, [h s ] = 


< 

< 


V D [h*} - V D [h s ] 

Pz?[h g*] - Pz>[hs] 

Plj[hG*] — Pd[ hG*] + Vd[ hG*] - PD s »[hG*] + Pd s „ [he*] — Pd[^s\ 

Pu[hG*] — Vd[ hG*] + Pd[^G*] - Pd s „[ Lg*] + Pd s „[^s] — PD[h S ] 

P D [hc*] - Pd [hG*] + sup (P D { h] - Vd s „[ h]) + sup {Pd s „[ h] - Pd[ h]) 

h e^ s , 

Vd[ he*] - Vd\ he*] + 2 sup |Po[h] - Vn s „ [h] | 

** 6 n n s , 

^(conf(hG*, D)) — ^(conf(he*, D)) + 2 sup |'0(conf(h, D) — ^(conf(h, Ds"))\ 

h 


< £ (G*, conf (he* ,D)) - (G*, conf (h G *,D 


+2 L sup 
hen 


I conf (h, D) — conf(h, Dg// 


ns' 


< 


2L£E x [||ii s ,(X) - ^mllJ + 2Z,cd" 21og( ' ,>1 ° g(am) + 1 ° g(n2/<i) 


where the fourth step follows by definition of hg, the previous to last step follows from the 
‘convexity-like’ assumption on if, and the last step follows from Lemmas [5] and O □ 


A.7 Proof of Theorem [9] 

Theorem (Regret Bound for (Idealized) Conditional Gradient Algorithm for Concave 
Smooth Non-decomposable Metrics). Let if : [0, l] nxn —>■ R + be concave overCD and (3-smooth 
w.r.t. the i\-norm overCn . Let h tw be the classifier learned by Algorithm^ with parameters k £ N 
and e > 0. Then 

8(3 


regret [h FW ] < 2e + 


Km + 2 









Proof. We use the result from [7], To apply this result we must upper bound the ‘curvature 
constant’ of i/j and the approximation factor 5 (which we call <5 apx ). 


CU = 


< 


sup - j U(Ci+ 7 (C 2 -C 1 )) -^(Ci) - 7 <C 2 -C 1 ,V^(C 1 
Ci,c 2 eC D ,7G[o,i] 7 v 


sup — 

Ci,c 2 eC D , 7 e[o,i] 7 


@ 211 /-i p 112 \ 

-7 ||Ci - G 2 ||i) 


= 4/3 


where the second step follows from the /3-moothness of if over Cd w.r.t. the t\ norm, and the last 
step follows from the observation that the entries of Ci and C 2 sum to 1 and are non-negative. 
One can also see that the approximation factor <5 apx < . Theorem 1 from |7j gives us 


regret 7) [h 


,FW 



ceC D 

< 

2 CV 

r + 2 

< 

2 

r + 2 

< 

8/3 

r + 2 

< 

8/3 

r + 2 


)) 


(1 + <5 a px) 

2(T + l)e 


+ 


r + 2 


8/3 


k m + 2 


+ 2e 
+ 2e. 


□ 


A.8 Proof of Lemma fTTl 

Lemma (Approximation Factor for Linear Maximization Step in Algorithm [4|). Let if : 

[0, l] nxn ^-M + satisfy the assumptions in Theorem 1 1 LA. Let u 3 and h J be the classifiers constructed 
in any given iteration j of Algorithm [7] using training sample S = (S',S") € (A x [n]) m and 
parameter a G (0,1). Also, let G 3 = VV’(conf(h- 7 ' -1 , D)). Then for any 6 G [0,1], we have with 
probability at least 1 — 5 (over draw of S from D m ) for all 1 < j < T 

(G 3 , conf(u J , D)) > max (G J , conf(u, D)) — eg, 
u: x —> A n 


where 

t s = 2LE x [||fj S .(X) - r,(X)U + 2 c ^io g (n)io s (arn) + lo g (nyS) t 

for a distribution-independent constant C > 0. 

Proof. Let 1 < j < T. Let G 3 = VV , (conf(h- ?_1 , Dg"))- Also let u* G argmax u .^_ >Ajj (G J , conf(u, H)). 
We then have by the definition of u 3 and Lemma [8] that 
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( 11 ) 


(G 7 , conf(u*, D)) — (G 7 , conf(u 7 , D)) < max (G 7 , conf(u, D)) — (G 7 , conf(u 7 , D)) 

u:X —¥ A n 

< 2LV x [\\riAX) - T7(X)||J 

Also 



G 7 ||oo = ||VV>(conf(hl-\ J D s «))-V^(conf(hl- 1 , J D))|| oo 

< j3\ |conf(h 7_1 , Dg") — conf(h 7 ' -1 , D)\ ^ 

< f3n 2 \ |conf(h 7_1 , Ds") — conf(h 7_1 , D)| 1^ 

< / 3n 2 max ||conf(u fc , Dg") — conf(u fc , D)]^ 

< fin 2 sup I |conf(h, Dg") — conf(h, D)\ I 

^n s , 


We then have 


( 12 ) 


max (G 7 , conf(u, D)) — (G 7 , conf(u 7 , D)) 

u: X —> A n 

= (G 7 ,conf(u *,D)) - (G 7 , conf(u j ,D)) 

= (G 7 ,conf (u*,D)) - (G 7 ,conf(u *,D)) + (G 7 ,conf(u*,£>)) - (G 7 , conf(u 7 , D)) 

< ||G 7 - G 7 '|| oo ||conf(u*,Z))|| 1 + (G 7 ,conf(u*,.D)) - (G 7 ,conf(u 7 , D)) 

= ||G 7 ' - G 7 '!!^ + (G 7 , conf(u*, .D)) - (G 7 \ conf(u 7 ', D)) 

= ||G 7 - G 7 !!^ + (G 7 , conf(u*, .D)) - (G 7 , conf(u 7 , D)) + (G 7 , conf(u 7 , D)) - (G 7 , conf(u 7 , D)) 

< 11 G 7 ‘ — G j \\ oo + 2L'Ex[\\ris'( x ) ~ »7P0|| J + (&, conf(u 7 , D)) - <G 7 , conf(u 7 , D)) 

< ||G 7 - G j \\ oo + 2LE, x [\\ris'( x ) ~ *7P0||J + ||G 7 - G 7 ||J|conf(u 7 ', D )\|, 

= 2||G 7 -G 7 || oo + 2LE A -[||r) 5 ,(X) - 

< 2 f3n 2 sup ||conf(h,£>s«) - conf(h,L>)|| + 2 LE a [||t 7 5 ,(X) - rj(X)\\ ] (13) 

h eHfig, 


where the first and third inequalities in the above are due to the Holder’s inequality, the second 
inequality is due to Equation |TT] and the last inequality is due to Equation [12] 

Applying Lemma [6] to Equation [13] the proof is complete. □ 


A.9 Proof of Theorem [T2l 

Theorem (Regret Bound for Sample-based Conditional Gradient Algorithm for a 
Larger Family of Non-decomposable Metrics). Let S = (S',S") € {X x [ n]) m be the given 
training sample drawn i.i.d. from distribution D. Let if : [0, l] nXTl —>• R + be such that for any 
p € (0,1), there exists if p : [0, l]" x ™ —»R + which is concave over Cd, L p -Lipschitz w.r.t. the t\ 
norm overCr) s „ and (3 p -smooth w.r.t. the i\-norm, with 

sup \if(C) -if p {C)\ <0{p), 
ce C D 
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for some strictly increasing function 9 : M + —>M + . Let hg W,p be the classifier learned by Algorithm 
[7] when applied to if p with training sample S and parameters k € N and a € (0,1). Then for any 
<5e[0, l] • we have with probability at least 1 — 6 (over draw of S from D m ) 


it ri^FW p-\ ^ a t m— /v ,|| 1 , 0 2 s~i I n2 l°g( n ) log (am) + log(n 2 /<5) 

regret^ [h^ w ’ p ] < 4L p E x [||t 7 5 ,(X) - rj(X) ||J + 4/3 p n 2 Cy- V - 

+ 2 0(p), 


8 fi P 

nm + 2 


where C > 0 is a distribution-independent constant. 
Proof. From Theorem 1101 we have that 


regret^'[h™’ p ] < 4L„E X [||lj s ,(.X) 


„po|i, 1+4 pycJ” 2 log(n) log(am)+ tos(nVS) 

" iJ V am 


+ 


8/3 P 

k m + 2 


For simplicity assume that the maximizer of ^(conf(h, D)) over h : X —» A„ exists. Let h* € 
argmax h .^^ A)i ^(conf(h, D)). We then have that 


regret [h 5 


= sup ^(conf(h, D)) — ^(co n f(h^ W ’ p , D)) 

h:X —> A n 

= ^(coid^h*, D)) — life onf(h^ W ’ p , D)) 

< 'i/j p (conf(h*, D)) - '0p(conf(h^ W ’ p , D)) + 2 6(p) 

< max ?/> p (conf(h, D)) — 'i/) p (conf(h^ W,p , D)) + 29(p) 

\a\?C —y An 

= regret[h^ W,p ] + 26 (p) 


□ 


B Details of Calculations for Smoothed Performance Metrics in 
Table H 


We now give details of derivation of the function 6 (defined in Theorem 1121) . Lipschitz constant 
L p , and the smoothness parameter fi p for the smoothed performance metric ifp. In each case, we 
make use the fact that the Lipschitz constant can be obtained by bounding the maximum absolute 
entry norm) of the gradient of and the smoothness parameter is obtained by bounding the 
maximum absolute entry norm) of its Hessian. 

H-mean. For the H-mean, t/> h (C) = n ( i —^—— J is ^—Lipschitz over Cp. Hence we 

have 6(p) = -^—p- The gradient of is given by: 


Vc u > P H ( C) 


= 


(c u 


a u,y 
Ll+P ) 2 


y-'n 

E n f^£= 

y =1 


C y,y+P 
l r 1 

{ ~ , u,u~rP 


E n 

V 


C y,y+P 


if u = v! 


otherwise 


30 
















The Lipschitz constant L p for <0® is then given by a bound on the loo norm of the above gradient. 


IIW£(C)||oo < 


< 


n 


max 

u e [n\ 


Ej?=l Cu.Jl+P 

(C u , u + P ) 2 


sy-n Ey = l + P \ 2 

2 —/y=l Cy^y + p J 


max 

u 6 [n] 


n _/ Ej)=i C U:V +p 

EgW C U ,y+P \ Cu,U+P 

( sr^n Eg=i ^«,y + P \ 2 

\ Cy,y+P J 

/ Eg=i Cix.g+P 

^ I Cu,u~\~p 

max --— -™—— 

»e[n] 7rtt + Py^ n ^5 =i u m+( ) 


Cy,y+P 


< 

< 


n 


max - 

uG[n\ H u + p 

n 


Next, we calculate the smoothness parameter /3„ of by computing the Hessian of ifi 


H. 


C) = 


-2nJ2 
{C u °u+p) 2 


y^u Cu,y 


y r 


c y,y + p 

Cy,y+p 


E n 

y =1 


e?= 1 c »,»+' 


Cu,u~\- P 

T~ 


- 1 


c y,y+p 


—2 n 
, n +P 




•spn *-‘y =1 ~y>y 1 ^ ] ^y^u U u,y , 

Z^y=l c y ,y+p J (C u ,u+P) 2 


2n 

Tc^TpF 


E n 

y 

E n 

y =1 


Cy,y+p 

+ P 


v 171 r* — 
^y=i °y,y 

C y,y+P 


Yly^u C u ,y 


—2 n 


y r 


V 2 1 c 

E n z - , y =1 

-17=1 


E? 


=i c y,y + P Y 


C y,y+P 
:r + P ' 


c y,y+p 


C u ,u +p 

^2y^:y ^ n , 


y-»n 

E n _“£= 

y=l 

Eg^.. c ~ ^ 


Cn,n+P 

X" 


- 1 


C y,y+P 


2n ^ yTU u,y *' ~ ->? 

"(C'n^+p)'^ (Cd^+p)^ - 


E 


£3=!°- 


y,y 


y=i 


C y,y+P 
2 n E££n_£u^_ i 
(C u , w +p)^ C?;,i;+P 


T 


E n 
n=l 


C 'y,y+^ 

o 1 ^v.y 

2n c u ,u +P { ct+J 

y^ri ^g=l + P 

Z_/ 7 /=l Qy y^-p 

l _ l 

C^n,n+P Ct),?;+P 


j y =1 
—2 n 


E n 
y — 


y — ^ Cy,y-\-p 


if u = u' = v = v' 

if U = u' = V 7 ^ v' 

if «/«' = « = «' 

\i u ^ u’ = v ^ v' 
if u = u' ^ v = v' 
\i u = u' ^ v ^ v' 

if m / u' / » = i/ 
otherwise. 


Bounding the entry of the Hessian matrix corresponding to confusion matrix entries C u y and C v y 
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for u! = u = v = v', we get 


< 


2re C u ,y \ ^y= 1 


5D^=i C's/,?/ + 1 


Cy,y+P 


C u ,u +p 


+ 1 


{C U} U + P ) 2 

Eg=l C Uy y+p 


\-^n E}=i C»,w + P \ ^ 
Z^i/=1 


< 2n 


< 2 n 


< 2 n 


-i 3 


Cu,u~\~P 


E^=i Cy,j/ 

L2^ w =l 


?/—1 Cy t y+P 
n \-^n 


Cy,y+P 

Cu,u “I - P 


( X^=l + p) 


E 

- \ y =l 


Efci C y,y + p\ 1 


E 

L \j/=l 


E£=i Cy.y + P 


a, 


+ 


^y,y + P 
Cu,u H - P 


Cu,u H - P 


+ 1 


n- 


y,y “T" P 

1 + p 1 + p 


{Y2=\ C u,y + p) (EfclC'u.y + p) J 


P J 


+ 


5 


The above bound can be shown to hold for all entries for which u = v. We next consider the case 
when v! ^ u / v ^ v'\ assuming w.l.o.g. that C UtU < C v>v , we have 


l V C«..,C„,^(C)l < 2n 

2 n 


Eff=i Cu,s? + p 
C u ,u +p 

y-m Eg=i Cy,? + P 

~ ^— j y = 1 Cy,y-\~p 


1 


(Ej=i c«,2/ + p)" 


< 3 


where the same bound holds for all Hessian entries corresponding to ti / u. The smoothness 
parameter in Table [2] then follows from the above bounds. 


Q-mean. For the Q-mean, ^P(C) = 1 — \j ^ ]E” =1 (1 — 


Cy ,y 

E^Tc— 


IS 




-Lipschitz over Cp. 


Hence we have 0(p) = -p=i— p. The gradient and Hessian for ?/;? are given by: 

\/ 77.71" m in ' 


Vo...,rf(C) = 2 


(Y^yzfiu C u,y + p) 

(Ea 1 o„ iff + p)i' 


v'n 


E 




| ^y^y c y.y + p 
\ ^y=i c y,y + p 

(J2yjt u C u,y + p) g u,u 

(£g =1 o„, ff + rt a 


v^n ( ^y^y c y,y + p 
^ y=1 \T,$ =1 C y,y+P 


if u = u' 


otherwise 


We next calculate the Lipschitz constant L p for i/jp by bounding the norm of its gradient. 


HVV’p (C)||oo < max 


1 


E£=l C Ul y+P 


u S [n] y/n 


E n 

v= l 


Eff=i Cy,y ' 


32 


















































1 


1 


= max 


Sjj/y ^y,y + P 

Cu,y+P 


u S [ra] C y fi + p 


1 1 

< max —-—- 

ue[n] y/n 2^y^y Cy,y + P 

1 

< 


Ey^y Cy,y + P 
y~ l VE g =l Cy,y+ P 


E n 

y= l 


Vnp' 

We next calculate the Hessian and bound its norm. 

Vh .c) = 


y/n 


E»^S C «,v' 1 '^ (oT^n ( C y ,y + p 

(£g = iQ„, ff + P) a 


c “,y + p 


+ 1 


2x 3/2 


(Sff =1 Wy 


// ^Wy c y,v + p N \ 

l Ey=1 Uff=i c v,s ? + 7 ) 

| V" | ^-y^y c y,y + p I - c y,y ~ p + 2Cu , u , r< 

+G “ 


\/n 


E 


/ ^W.y c .y.y + p 


2\ 3/2 


V v^n 

2j "= i V s -:; i r -y - -/ 


H 1 V ^--y=i c y.y + p 
/ ^-y^y c v,v + pN \ ^y#y c y,y +p ~ 2 Cl y>y 


-E 


P-'ii.h I 


y^u '-'u.y 


v/rT 


E n 

y =1 


2\ 3/2 


W 71 P7~ 

^y=l 2/>y 
2 


+ P 


Cu,u f /" '^- j y^y ^y,y P A ^(^y^u ^u^y ^-'u,v 

l C »,i + P ) 3 \^y =1 VE^lj C y,y + p) £ g= l C u ^ + P 


"t Ey^u Et,y P 


Kg-^H g M,g- + P ) 2 Ky^y C t;,g + p ) 2 

+■ P ) 3 

2\ 3/2 


/v^y- / ^y#y c y,y + p ^\ 'i 

V^ =1 V^=i c y,y+E ; 


2\ 3/2 


1 ^p=i g «,a C v,y ~ *~ 

\/n 


Y^n ( ^y#y g y,y + p 

. 1,-1 \ ^ g =l C y,y+P 

,y p)Cu,u (Ey^y C v y + p) 

(SS = iC„ |ff + p)3 - ' 




2\ 3/2 


/Y^ n / ^y^y c y.y + p ^\ 'i 
V Ea=1 \^i c v,v+») ; 

(Yly^u ^U,y ( Yly^y ^~'y^y^~ P)^V,V 


1 (EgliO^ + P) 3 (E 


'ff=i E.y + p) 3 


2\ 3/2 


V Ev=1 Ul?=i c y.y + E ; 

(Ej/^n c u,y + P) g y,y Q2yjt y c v,y + P) g y,y 




(Eg =1 g», g W" 


(Sg=i°>,g+»J r 


y^Ti ^ ( ^y^y C y,y 


+ P 


2 \ 3/2 


Vsp =1 


+ P 


if u / = u = v = v' 

if u' = u = v ^ v 1 

if u' ^ u = v = v' 

\ivl^u = v^v' 

if u' = u ^ v = v' 

if w 7 / u 7 ^ v = v' 

if u' = u v v' 

otherwise. 


We now obtain the smoothness parameter j3 p for ip®. We start by bounding the entry of the 
Hessian matrix corresponding to confusion matrix entries C u u > and C v ,/ where u' = u = v = v'; 
the same bound can be shown to hold for all entries for which u = v. 

i v Ea,.,V’?(c)| 
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Ey^u C-u^y + P \ 2 n 3/2 

ES=iC^ + pJ 


V^n fJ2y 7 iy C y,y + P\ 

l ^y=^\T^c^+-p) J 


Es=i + P 




Thy^y Cy,y + P 


L ?/=l 


“f \ E£=i C?/,y + P / Eff=i C'u.y + P 


+ 1 


< 


< 


< 


< 




Yly+y Cy,y + P 


V™ YZjj=l C u ,y + P y=1 \Z]|=1 Cy$ + P) Y%j= 1 C u ,y + P 


+ 1 


V™ Efcl C u ,y + P 


3 n 


E£=l C u ,y + P 


+ 1 


1 1 


/n p 
4 y'n 


3n 


+ 1 


We next bound the entry of the Hessian matrix corresponding to u' ^ u ^ v ^ v'\ the same bound 
can be shown to hold for all entries where «/«. Assuming w.l.o.g. that Yly^u ^u,y > Cv,yi 


IV 


2 

P'UU 1 


,^(C)\ < — 


T,yjiuCu,y+ p \ 2 -| 3/2 


jy =1 C u ,y H - P 




E n 

y= l 


+ p 


Py,y 

Efci Cy,y + P ) 


V 




a 


+ P (52$=1 C v ,y + p) 3 


< 


< 


1 

7^ E 
1 


c, 


u,u 


a 


v,v 


y^u Cu,y 

Ciiii 


P Q2U c *s 

1 


p) 3 


\fn £ 
i i 

3 


y^u Cu,y + P (£?/=l Cv,y + PY 




n p J 


The smoothness parameter f3 p in Table [2] then follows from the above bounds. 


Cy 


G-mean. For the G-mean performance metric, ■0 G (C) = ( FT” , to—~— 

\ A y 1 2^y=i ^y,y 

over C/> We now explicitly derive the form of 0 for this performance metric. 


l/n 


is not Lipschitz 
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< 


n 


c, 


y,y 


l/n 


+ 2 p 


l/n _ 


ii v n r - 

~y =1 ^2/=l 

= 2p 1/,n , 

which gives us #(p) < 2 p l / n . Next, we provide the gradient if>^. 


VcM’?(C) = 




n v" <7 

«=i ^y=i 


l/n 


/" C u ,y-\~P \ 

1 71 1 Ej?^U C’u.J y-,- , 

^ Cy,y+p \ 

\E |?=1 C u,y+Pj 

(E~ =1 C„ i5 +p)^ ily^n 1 

lE^=i C y,y+P) 


( Cu,y+p \ 

1 71 1 TT i 

f Cy,y+P \ 

\Ey=l C u ,y+p ) 

(Ej=i c u . 5 +p)^ 1 

\ Ey = l Cy,y+P ) 


l/n 

l/n 


if u = u! 
otherwise 


The Lipschitz constant for ^ is then obtained by bounding the norm of the above gradient. 

I|V^ G (C)|U 


ue 


< max 
u e [nl 


= max 


!( 

^ a. 

,u + P 

n \ 

sr^n 

\2-^y=l 

Cu,y “1“ P 

l i 

' C- U; 

,u + p 

n \ 

sr-^n 

\2-yy=l 

^u,y “1“ P 

l i 

' c u . 

,u + P 

n \ 

sr^n 

\2^y=l 

C'u,y “1“ P 


i-i 


max 


Cu,yi C u ,u} 


(Ej/=1 C u ,y + pY 
T_1 EsLl Cu,y + P 


1 


(E£=l c u ,y+ P y 

1 

EEl Cn,y + P 


< 


1 


n \ 1 + p 

lp + P 

n I p 


L-l 


1 -- 


The Hessian for takes the form: 
72 _/.G 

t'-H = 

1-2 


VL,C„>?( C) = 


uu' ’> vv' 


- 11-1 


11 1 — - 


Cu,U~\~P 

Eff=l Cu,y+P 


\ 2 -TT / Cy : y+P 

(i:~ =1 C U ,y+py) lly^U {^ =1 C y ,y+P 


l/n 


if u' = u = v = v' 


Cu,u+P " 2 Yly^tu C-u.,y -p-f f Cy t y + p \ 

Eij?=l C U: y+p) (I2$=lCu,y+P) 4 lly^ u vEg^l Cy.y+P/ 

if u' = u = v ^ v' or u' ^ u = v = v' or v! ^ u = v ^ v' 


Cy y ~j ~p \ >7 ^ Eg^u fix,y _ Eg^p Cy,y T“r / C y , y +p ^ 


1-1 


1 / i / ^v,v \ p' \ ■" ~ “ii/ ^— 'yj=v u iy | | / v -' y,y i z-' 

^lEg=i<E,g+pJ Ie?E^E+^ (Eg=i c^.v+p)* (£2=1 c„,g+p)^ ^ 11 ^ ^ Eff=l Cy,y+P 


1 

n 2 


1 


C u ,u~\~P n 


l —l 


Cv,v~\~P \ n 


if u’ = u 7^ v = v ! 

/ lyy-n Cu : y C v 


Cy,y+P 


^u,u i p> i i i ,u z—syytu “,y _ ^ v,v _ i I / oy,^ i p> 

Eg=l C u,y+p) \Eg=l Cv,y+p) (Eg=l C u,y+p) 2 (Eg=l C^g+p) 2 ^ V Eg=l Cy,y+P 

if a' = m / » / »' 


_1_ 

Gu/ugTp \ n 


G*„,„+p \ n 1 


C,,. 


Ey/,; Nm, 


Cy,y+P 


v -"u,'u i / ^VjV i A" \ y - y u,u zsy^zv u ! y i i / ^y,y 1 r 

E^iC^ff+p,/ IEsUC-mtW (E^=i d, s +P)^ '(E^i y# 11 ^ l E?=i 

if u' ^ u ^ v = v 1 


l/r 


1/r 
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( Cu,U~\~P \ n f Cv,V~\~P 71 Cu,U Cv,v 1 \ 

c», v +p) \±% =1 c\ v +p) (E^i c Ut9 +p)» (Eg =1 c VtS +p )»^ 1 

otherwise. 


Cy^y+E 

Es=i ^i/.s+p 


1/r 


The smoothness parameter /3 p is then given by the following bound on the norm of the Hessian: 

II V72„/,GII 


l|V 2 ^(C)||oo 
< max -rr | 


/ C U)U + p \ 1 / + p \ max{C u , u , <?„,{?} max{C v , Eg*, 

^E~=i C u ,y + p) \Y$=i c v$ + p) d^=i c u,y + p ) 2 (E~=i C v ,y + P y 


u,v e [n] n 2 V Eg=i C u $ + p 

^ 1 / C uu + p 

< max — =——-— 

ne[n] U 2 VLy=l Cu,y + P 


E n 

y =1 °v,2/ 

HV E?=i^+p 


(Eg=i^ + p) 5 


1 

= max —r 


Cu,u T p 


--2 


we [n] n 2 VEgLl C u ,y + P/ \X^=1 Cw,j/ + P 

, v 2._o 

1 / - N - 


n 2 V 1 + p 


44 ^) 4 

n \ p J p 
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